Sankoff et al. BMC Genomics 2012, 13(Suppl 1):S8 
http://www.biomedcentral.com/1471-2164/13/S1/S8 



PROCEEDINGS Open Access 



A model for biased fractionation after whole 
genome duplication 

David Sankoff, Chunfang Zheng, Baoyong Wang 

From The Tenth Asia Pacific Bioinformatics Conference (APBC 2012) 
Melbourne, Australia. 17-19 January 2012 




Genomics 



Abstract 

Background: Paralog reduction, the loss of duplicate genes after whole genome duplication (WGD) is a pervasive 
process. Whether this loss proceeds gene by gene or through deletion of multi-gene DNA segments is 
controversial, as is the question of fractionation bias, namely whether one homeologous chromosome is more 
vulnerable to gene deletion than the other. 

Results: As a null hypothesis, we first assume deletion events, on either homeolog, excise a geometrically 
distributed number of genes with unknown mean yt, and a number r of these events overlap to produce deleted 
runs of length /. There is a fractionation bias 0 < q> < 1 for deletions to fall on one homeolog rather than the 
other. The parameter r is a random variable with distribution tt(-). We simulate the distribution of run lengths /, as 
well as the underlying tt(-), as a function of fj, (p and 6, the proportion of remaining genes in duplicate form. We 
show how sampling / allows us to estimate jj and <p. The main part of this work is the derivation of a deterministic 
recurrence to calculate each n(r) as a function of jj, q> and 9. 

Conclusions: The recurrence for n provides a deeper mathematical understanding of fractionation process than 
simulations. The parameters (j and cp can be estimated based on run lengths of single-copy regions. 



Background 

Whole genome doubling (WGD) creates two identical 
copies (homeologs) of each chromosome in a genome, 
with identical gene content and gene order. From this 
ensues the wholesale shedding of duplicate genes over 
evolutionary time through random excision - elimination 
of excess DNA - namely the deletion of chromosomal 
segments containing one or more genes, or through 
gene-by gene events such as epigenetic silencing and 
pseudogenization [ 1 -6] . 

When a duplicate gene is lost, it may be lost from one 
copy {homeolog) of a chromosome or the other, but gen- 
erally not both, because of the necessity of conserving 
function. This fractionation creates an interleaving pat- 
tern; the full original gene complement becomes appar- 
ent only by consolidating [5] the two homeologous 
single-copy regions. In most cases, there is a degree of 
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bias, more genes being lost from one of the homeolo- 
gous regions than the other [4-7]. Fractionation is an 
important process in many evolutionary domains, in 
particular the flowering plants, since it results in a gen- 
ome that is highly scrambled with respect to its pre- 
WGD ancestor. For this reason as well, fractionation 
raises a number of interesting and difficult problems for 
comparative genomics. 

The study of fractionation is basically a study of runs, 
that is runs of duplicate genes on two homeologous 
chromosomes alternating with runs of single-copy genes 
on one or both of these chromsomes. Because of the 
way these runs are generated biologically, and because 
they involve two chromosomes evolving in a non-inde- 
pendent way, standard statistical or combinatorial run 
analyses are not directly applicable. 

In this paper, we present a detailed version of the exci- 
sion model of fractionation with geometrically distributed 
deletion lengths, for which we previously analyzed a tract- 
able, but biologically unrealistic, special case [8]. The key 
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problem in this field is to determine \i, the mean of the 
hypothesized geometric distribution since this 

bears directly on the main biological question of the rela- 
tive importance of random excision versus gene-by-gene 
inactivation. The relevant data consist of runs of single- 
copy genes (whose duplicates have been lost from the 
homeologous region) as well as runs of remaining dupli- 
cate pairs in two homeologous regions. The inference of {i 
is complicated since each run of / single copies may have 
been produced by an unknown number r of deletion 
events, either r - I events (the gene-by-gene model) or 1 < 
r <l - 1 (the random excision model), and these r samples 
of the distribution p turn out not to be independent. Thus 
a fundamental aspect of finding and hence p(^, .), is to 
derive n(r), the proportion of runs of single-copy genes 
with r terms, for r = 1, 2, .... 

A further complication arises from the way deletion 
events accumulate into longer runs of single-copy genes. 
The deletion of a certain number of duplicate genes 
may overlap the site of a previous deletion event on the 
same chromosome, but it is blocked by the functional 
constraint (mentioned above) as soon as it starts to 
overlap the site of a previous deletion event on the 
homeologous chromosome. 

Another biologically important question is to deter- 
mine (p, the proportion of deletion events that operate 
on one of the homeologous chromosomes, while a pro- 
portion 1 - (p operates on the other. We explored this 
question at some length in [4], but a detailed mathema- 
tical treatment of the effects of this "fractionation bias" 
remains to be done. 

It is not difficult to simulate the fractionation process, but 
this gives little insight into its mathematical structure. 
Given that it is unlikely for any closed form of n to exist, 
nor for any simple computing formula, our goal here is to 
develop a recurrence for the distribution of n(r) for r = 1, 
2, ... as a function of ft, (p and 6 (the proportion of duplicate 
pairs remaining in the genome versus single-copy genes). 

This work is an attempt at creating a rigorous "null" 
model of duplicate loss, based on parameters (p and 0. 
This should provide a principled basis for developing 
statistical tests on real WGD descendants, to see if the 
geometric excision hypothesis is acceptable and to see if 
fractionation is unbiased or not. We will not explicitly 
investigate the alternative hypothesis of gene-by-gene 
deletion, nor do we take chromosomal rearrangement 
events into account; our task here is simply to set up 
the null statistical model with a view to enabling useful 
statistical tests of hypothesis for this problem. 

The models 

The structure of the data 

The data on paralog reduction are of the form (G, H), 
where G and H are binary sequences indexed by Z, 



satisfying the condition that g(i) + h(i) > 0. This condi- 
tion models the prohibition against deleting both copies 
of a duplicated gene. We may also assume that whatever 
process generated the Os and Is is homogeneous on Z. 

The sequence G + H consists of alternating runs of Is 
and 2s. We denote by p(l), I > 1 the probability distribu- 
tion of length of runs of Is. For any finite interval of Z 
we denote by/(/), / > 1 the empirical frequency distribu- 
tion of length of runs of Is. 

The use of Z instead of a finite interval is consistent with 
our goal of getting to the mathematical essence of the pro- 
cess, without any complicating parameters such as interval 
length. In practice, we use long intervals of at least 
100,000 so that any edge effects will be negligible. See [4,8] 
for ad hoc ways of handling biological scale intervals. 
The deletion events 

Let (p } where 0 < (p < 1, be the fractionation bias. We 
assume a continuous time process, parameter X{t) > 0, 
only to ensure no two events occur at the same time. 

♦ We start (t = 0) with h(i) = g(i) = 1 for all /. 

♦ At any t > 0, consider any i where h(i) - g(i) - 1. 
With probability X{t)dt, a deletion event occurs 
anchored at position i: we choose a positive number 
a according to a geometric variable y with parameter 

i / i \ a ~ l 

II U, i.e., P[y = a] = y(a) = -I 1 ) , a > 1. 

/x V /V 

♦ Then with probability (p we choose to carry out the 
deletion on G; with probability 1 - (p, on H. 

♦ If the deletion is on G we convert g(i) - 0, g(i + 1) 
= 0, g(i + a - 1) = 0 unless a "collision" occurs. 

♦ One type of collision, skippable collision, arises 
when one or more of g(i + 1), g(i + a - 1) is 
already 0. In this case we skip over the existing 0 
values and continue to convert the next available Is 
into 0s, until a total of a Is have been converted, or 
a collision of the second type is encountered. 

♦ The second type of collision, blocking collision, 
arises when one or more of h(i + 1), h(i + a - 1) 
(or a further term if skipping has already occurred 
during this event) is already 0. In this case, further 
conversions of Is to 0s are blocked, starting with the 
first g(x) for which h(x) = 0. 

Skippable collisions are a natural way to model the 
excision process, since deletion of duplicates and the 
subsequent rejoining of the DNA directly before and 
directly after the excised fragment means that this frag- 
ment is no longer "visible" to the deletion process. 
Observationally, however, we know deletion has 
occurred because we have access to the sequence H, 
which retains copies of the deleted terms. Blocking colli- 
sions are a natural way of modeling the constraint 
against deleting single-copy genes. 
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When the deletion event has to skip over previous Os, 
this hides the anchor i and length a of previous deletion 
events. Denote by r the random variable indicating the 
total number of deletion events responsible for a run. 
Then, given r = r, the run length z is distributed as the 
sum of r geometric variables, which would result in a 
negative binomial distribution were these geometric 
variables independent. They are not, however, since 
events with large a tend to group together in runs with 
large r, while an event with small a is more likely to 
constitute by itself a run with r = 1 [8]. 

If we observe G at some point in time, as in the last 
pair of rows of Table 1, all we can observe are the run 
lengths of Os and Is. We cannot observe the a, i or r, 
while t and X(t) are unknown and, as we shall see, only 
mathematical conveniences that are supplanted by 6 in 
our calculations. The parameters about which we wish 
to make statistical inferences are the deletion length dis- 
tribution parameter ft, and the fractionation bias (p since 
it is these quantities that are at the heart of the biologi- 
cal controversies about paralog reduction. This inference 
can only be based on the two observable quantities: the 
run lengths / and the proportion 0 of remaining (unde- 
leted) Is. 

Results 

Simulations to determine n 

We carried out simulations on an interval of Z of 
length 100,000. This enabled us to use a discrete time 
process instead of the continuous time process on Z. 
The "anchors" for the deletion events were chosen at 
random among the currently undeleted genes. The 
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Five deletion events affecting two homeologous chromosomes, leading to 
two runs of single-copy genes. The fourth step illustrates the "skip" process, at 
/' = 5 where the pre-existing deletion is incorporated into a longer run with r 
= 2. The fifth step shows how further deletion (at /' = -1) and the "skip" 
process (to /' = 2) are blocked when a single-copy gene is encountered (/' = -1) 
on the homeologous chromosome. This creates a single-copy run with length 
/ = 7 and r = 3, part on one chromosome, part on the other. Note that r is not 
observable from the genome data. 



remaining steps were carried out as described in the 
previous section and Table 1. Because each simulation 
run samples thousands of deletions, it sufficed to do 
100 runs for each value of the parameters ft and (p 
studied. 

The top row of Figure 1 compares n(r) when 6 - 0.5 
and 6 = 1, for /u = 2, 3, 6, and 11, when q> = 0.5. We 
can see that the number of deletion events contributing 
to a run is somewhat dependent on ft when half of the 
the sequence has been deleted, but is strongly depen- 
dent when 90% has been deleted. In the bottom row, 
the graph on the left shows that run length 1 is distribu- 
ted very differently for p = 2, 3, 6 and \a = 11, when the 
proportion of the sequence deleted is exactly the same. 
This strongly suggests that observing the run length dis- 
tribution and the overall proportion of deletions should 
allow us to infer ^. Moreover the shape of these distri- 
butions is sensitive to q>. 

We mention that any edge effects in our simulation 
are negligible. Whether we work with G and H on an 
interval of Z of length 100,000 or, as previously [8], 
length 300,000, gives virtually the same results. 

Figure 2 shows the relationship, for three values of the 
fractionation bias (p and for a range of values of ft, 
between the proportion of genes deleted, on one chro- 
mosome or the other, and the average run length. This 
confirms that average run length and overall proportion 
of deletion 0, both observable, can be used to infer [A 
rather accurately, and to infer (p, perhaps with somewhat 
less precision. The latter parameter can, however, be 
inferred from the shape of the run length distribution in 
Figure 1 (bottom) or estimated directly from the propor- 
tion of single-copy genes on each homolog. 

A recurrence for n(r) 

We are interested in inferring ^ from the observed dis- 
tribution of run lengths and the proportion 6 of unde- 
leted terms, i.e., undeleted genes. At the outset 0=1. 
As t — > oo, 0 0. We are not, however, interested in t, 
since it is not observable and any time-based inference 
we can make about [i will depend only on run lengths 
and 6 in any case. On the other hand, r, the number of 
deletion events per run is an interesting variable since 
we can assume run length is close to 77/ on average, at 
least for small values of 6, and we can model the evolu- 
tion of r directly We consider the distribution ji as a 
function of (p and 

As 71 changes, probability weight is redistributed 
among several types of run: 

1. new runs (r = 1) falling completely within an 
existing run of undeleted terms, not touching the 
preceding or following run of deleted terms, type A 
in Figure 3, 
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Figure 1 Simulations of events per run and run length. Distribution of number of deletion events r composing each run when 1 - 6, the 
proportion of sequence deleted, is 0.5 (top left) and 0.9 (top right). <p = 0.5 in both cases. Distribution of run length for for (p = 0.5 (bottom left) 
and (p = 1 (bottom right). For visibility, all diagrams show highest frequency parts of the distribution only. 



2. runs that touch, overlap or entirely engulf exactly 
one previous run of deleted terms with r > 1, thus 
lengthening that run to r + 1 events, types B and C 
in Figure 3, 

3. runs that touch, overlap or engulf, by the skipping 
process, two previous runs of r x and r 2 events 
respectively, creating a new run of r\ + r 2 + 1 events, 
and diminishing the total number of runs by 1, 
including types D and E in Figure 3, 

4. runs that touch, overlap or engulf, by the skipping 
process, k > 2 previous runs of of r lf r k events respec- 
tively, creating a new run of r± + ... + f> + 1 events, and 
diminishing the total number of runs by k - 1, not illu- 
strated in Figure 3. Case 3 above may be considered a 
special case of this for k =2 and Case 2 for k = 1. 

The first process, involving a deletion event of length 
a requires a run of undeleted terms of at least a + 2. 



What can we say about runs of undeleted terms? We 
know that runs of deleted terms alternate with runs of 
undeleted terms, so that there is one run of the former 
for each of the latter. The mean lengths u and v of the 
deleted runs and the undeleted runs, respectively, should 
satisfy: 



The distribution p(l) of lengths of the undeleted runs 
is assumed to be geometric. Similarly the lengths of suc- 
cessive undeleted runs (indeed all undeleted runs) are 
assumed to be independent. While we do not have a rig- 
orous proof of these assumptions, they have been con- 
firmed by extensive simulations. 

Let (pi and (p 2 be the proportion of deletion events 
affecting homeologous chromosomes 1 and 2, respec- 
tively, so that (pi + (p 2 - 1. Let r(r) be the proportion of 
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Figure 2 Dependance of run length on deletion parameters. Average length of run of single copy genes in for <p = 0.5, 0.75, 1.0, for fj = 2, 
3, 6, 11. 



runs of single-copy genes with terms in both chromo- 
somes. (r(l) = 0 and, initially, r(r) = 0 for r = 2, 3, ....) 
Note that in such a run, the term(s) at the extreme left 
were (was) deleted from chromosome i with probability 
(pi and the same for the terms at the extreme right. 

The proportion of undeleted terms in runs of length / 
is lp(l)/E p , where E p = Z />0 ^p(0- As depicted in Figure 
3, the probabilities Pa 1 and Pa 2 that a deletion event 
affects chromosomes 1 or 2, respectively, and falls 
within a run of undeleted terms of length / without 
deleting the terms at either end is, for i e {1, 2} 

l>2 j-2 ' a=l 

= ;rE^)Ei>) (2) 

p l>2 j=2 a=l 

, 1-2 

^ l>2 a=l 

where / indexes the starting position of the deletion 
within the run, and a is the number of terms deleted in 
the event. We define the contribution to mean run 
length of A events to be 

2 1-2 

Ma = E^rE^WE^ - ^ - l)y{a)a. (3) 

i=l p l>2 a=l 

Events of type A t create runs of deleted terms with r = 
1 from one chromosome only. Note that the last line of 



equation (2), and equation (3), involve the collection of 
terms, reducing the number of nested summations in 
order to speed up calculation. While these are not 
lengthy calculations to start with, we display the speed- 
up as a simple illustration of the important efficiencies 
implemented for more difficult cases to be treated 
below. 

The probability pB if that a deletion event on chromo- 
some i touches only the run of deletions on chromo- 
some / on the left of the run of undeleted terms is, for i 
g {1, 2} and fe {1, 2}, 



l-i 



(4) 



a=l 



We define the contribution to mean run length of B 
events to be 



Mb 



EE ^ Ep(/)Ey(a)a 

M/-1 



l-l 



(5) 



l>i 



Events of type B a turn a deleted run with r events 
from one chromosome, into a run with r + 1 events. 
Events of type B t p with i * f, turn a deleted run with r 
events, into a run with r + 1 events. 

The probability pc ti that a deletion event, on either 
chromosome, does not touch the run of deletions on 
the left, does touch or overlap the run of deletions on 
the right entirely on the same chromosome (homeolog), 
but does not extend over the entire run of undeleted 
terms beyond that is, for i e {1, 2}: 
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Figure 3 Types of event. Types of deletion event affecting less than three pre-existing runs. Red and blue shading distinguishes between 
deletions from the two homeologous chromosomes. Grey areas represent previous deletions from either chromosome. White area indicates run 
of undeleted terms. Lightly shaded area indicates run of previously deleted terms. Darker area represents current deletion event. Hatched striped 
area above lightly shaded area indicates either previous deletions from both homeologous chromosomes, or only from the homeolog not 
affected by the current deletion. A: creates one new run with r = 1. B: lengthens left hand run to r + 1 events. C: lengthens right hand run to r 
+ 1 events. D and E: merge two runs to create a single run with r + s + 1 deletion events. 



PQ, = ^^££p(DpM££ Via) 

9 V>\ k>\ j=2 a=l-j+l 

■ ^EE*« (6) 

p i>i fe>i v ' 

/min[!-2,fe-l] mw[i-U] i+fe-2 \ 

£ «y(a)+ £ mm[l-l,k]y{a) + ^ (l + k — a- l)y(a) J . 

\ <«ataH-U] fl =max[/,fe + l] / 

We define the contribution to mean run length of Cu 
events to be 

Mc„ = E M F^ i ££^)PW£ £ Y{a)a, (7) 

1=1 p />1 fe>l j=2 a=l-j+l 

which can be calculated using an expansion such as 
that in (6). Events of type Cu turn a deleted run with r 
events from one chromosome, into a run with r + 1 
events. 

The probability Pc if that a deletion event, on either 
chromosome, does not touch the run of deletions on 
the left but does touch the run of deletions on the right, 
partly or entirely on the other chromosome, is, for i * f 



e {1, 2}: 

(Pi! + 0*0/(1 - r) ^ J-y A 

pc if = j yW- ( g ) 

P />1 j=2 a=/-j+l 

We define the contribution to mean run length of Qy 
events to be 

^•i ^^-^ mio-i*!) i yw. (9) 

j#=l 9 l>\ j=2 fl=I-j+l 

Events of type Cy, with / * / turn a deleted run with r 
events, into a run with r + 1 events. Note that (9) does 
not contains terms of form ay(a) as do (3,5,7), since in 
this event deletion is blocked beyond the existing run of 
deletions; the probability weight is thus concentrated on 
deletions of lesser length. 

The probability pD iti that a deletion event completely 
overlaps the run of deletions on the right and touches 
or overlaps the run of deletions beyond that, all on the 
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same chromosome, but does not extend over a further 
run of undeleted terms: 



(10) 



( £ 



(a-fe)y(fl)+ £ min[I-l,h]y(<i) + £ (I + fc + h- a- 



in which the reduction of the number of nested sum- 
mations is key to the computability of the entire 
calculation. 

We define the contribution to mean run length of D Ui 
events to be 



/ l-j+k+h 



EEE^cwiE E rW a ' (ii) 



/>l fe>l h>l 



j=2 a=l-j+k+l 



which can be calculated using an expansion such as 
that in (10). Events of type Dm turn two deleted runs 
with r and s events, respectively, both from the same 
chromosome, into a run with r + 5 + 1 events. 

The probability pD iif that a deletion event completely 
overlaps the run of deletions on the right, on the same 
chromosome, and touches the run of deletions beyond 
that, partly or entirely on the other chromosome, is: 

Pd „ t ^- 2(1 - t)t ;^ (1 - t)2 ee^we e (12) 

p 1>1 k>\ j=2 a=l-j+k+l 

and the contribution to mean run length is 



0, 2 (1-t)t + </>,%(! -t) 2 



EE'°Wp(fe)E( | -i + fe+1 ) E K4 (13) 



Events of type D U p with z * f, turn two deleted runs 
with r and 5 events, respectively, with the latter contain- 
ing terms from both chromosomes, into a single run 
with r + 5 + 1 events. 

The probability pE iti that a deletion event touches the 
run of deletions on the left of the run of undeleted 
terms and touches or overlaps the run of deletions on 
the right, all on the same chromosome, but does not 
extend over the entire run of undeleted terms beyond 
that is: 



pEui = 

where 



Z>1 fe>l 



0, 3 (1 - r) 



a=l 



l+k-1 



EE^Mfe)EH«K as) 



/>i fe>i 



a=l 



The probability P% that a deletion event touches the 
run of deletions on the left of the run of undeleted 
terms, both from the same chromosome, and touches 
the run of deletions on the right, partly or entirely on 



the other chromosome, is: 

0fr + 0 x %(l-r) 
pEuf = i 

and 



£>(/)]T>(fl) (16) 



1>1 a=l 



Me, 



,- *"** ( '-° S>mif>M. a?) 



/>1 



The probability p% that a deletion event touches the 
run of deletions on the left of the run of undeleted 
terms and touches or overlaps the run of deletions on 
the right, all on the same chromosome, but does not 
extend over the entire run of undeleted terms beyond 
that is: 



pE lfi = 



and 



#0/(1 " r) 



i+k-i 



£2>(l)p(fe) £ Y{a) (18) 



Z>1 fe>l 



1+fe-l 



/>i fe>i 



The probability P% that a deletion event touches the 
run of deletions on the left of the run of undeleted 
terms and touches or overlaps the run of deletions on 
the right, all on the same chromosome, but does not 
extend over the entire run of undeleted terms beyond 
that is: 



pE lff = 

and 



0i0/T +0f0?(l - t) 



hf) 1>1 a=l 



(21) 



Events of type Em turn two deleted runs with r and 5 
events, respectively, all from one chromosome, into a 
single run with r + 5 + 1 events. Events of type E U f t 
and Eiff„ with i * f, turn two deleted runs with r and 5 
events, respectively, into a single run with r + 5 + 1 
events. 

We reiterate here that the last lines of each of (2), (6) 
and (10) include the collection of terms, significantly 
cutting down on computing time when these formulae 
are implemented, especially in the case of (10). 

In this initial model, we neglect the merger of three or 
more runs of deletions. There is no conceptual difficulty 
in including three or more mergers, but the proliferation 
of embedded summations would require excessive 
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computation. Thus we should expect the model to be 
adequate until 0 gets very small, when mergers of sev- 
eral runs at a time become common. 

Let Pa = Pa 1 + Pa 2 , and similarly let each of p B , p E be 
the sums of their respective subscripted terms (with all 
combinations of i and j). We define the change S n {r) in 
the number of runs of deleted terms with r = 1, 2, .... 



^(1) = Pa - [pB +pc + 2p D + 2p E );r(l). 



(22) 



S n {2) = (p B + pcW) - (p B +Pc + 2pD + 2p£)7i(2). (23) 
For r > 2, 

r-2 

8*{r) = (p B+ p c )n(r-l) + (2p D+ 2p E ) n{s)n(r - s - 1) - (p B + p c + 2p D + 2p E )jt(r). (24) 

s=l 

In an implementation on a finite interval of Z, the 
number of runs of deleted terms will change from some 
value R to R\ where 



(25) 



r=l 



The distribution of number of events per run will also 
change from n to tt\ where 



jt'(r) = 



Rn{r)+& n {r) 
R' 



(26) 



and where the mean of the number of deleted genes 
per run increases from u to u'> so that 



_, Ru + J2x=A,B,C. l D. l E. ^ X 



(27) 



The mean 1/ of the new distribution p' of run lengths 
of undeleted terms satisfies 



v = — (u + v) — u' . 



(28) 



The new proportion 6' of undeleted terms is 
T/l{u f + i/\ 

In the same interval of Z, we define the change 3 T (r) in 
the number of runs containing single copy genes in both 
chromosomes with r = 1, 2, .... 



5r(l) = 0. 



(29) 



5 T (2) = (p Bl2 +p Bl2 +pc 12 +pc 21 )n{l) - {p B + pc + 2p D + 2p E )TT{2)r{2). (30) 

For r > 2, 

*r(r) = (Pb + Pc)ir{r - l)r(r- 1) + (p Bl2 +p Bl2 +Pc 12 + Pc 21 );r (r - 1)(1 - r(r - 1)) 

r-2 

+ (2p D + 2p £ ) 7r(5)7r(r - s - 1)(1 - (0? + 0 2 3 )[1 - r(r - s - 1)][1 - r(s)]) (31) 
- (p B + pc + 2p D + 2p £ )r(r)jT(r). 

In the implementation, the number of runs of deleted 
terms with genes on both chromosomes will change 



from T(r) to T\r), where 
T'{r) = T{r) + S z {r). 



(32) 



The proportions of runs with deletion events from 
both chromosomes will also change from r to z', where 



r'(r) 



T{r) 
R'ji'(r) ' 



(33) 



We implement equations (1) to (33) as a recurrence 
with a step size parameter A to control the number of 
events using the same p Al p B , p c , p D > Pe> S„(-) and d T (-) 
between successive normalizations, and using Ad n (-) and 
A^ T (-) instead of S n (*) and ^ T (-) in (25)-(33). The choice 
of A determines the trade-off between computing speed 
and accuracy. 

Figure 4 shows the results of our current implementa- 
tion of our deterministic recurrence for the cases ^ = 2 
and ft - 11, for unbiased fractionation ((p = 0.5) and for 
extremely biased fractionation ((p = 1). The results fit 
simulations of the stochastic model quite well and reveal 
a number of tendencies. One is that unbiased fractiona- 
tion with small deletions leads to the fastest drop in 
events of type A as 6 decreases. 

Biased fractionation with large deletion sizes leads to 
slow initial growth in the proportions of events of types 
D and E and "other". 

There are at least two reasons for the discrepancies 
between the simulations and the recurrences observed 
in Figure 4. At the outset, since we used a large step 
size A for the computationally costly recurrence, its tra- 
jectory lags behind the simulation, especially with 
respect to the slower decrease in p A and slower increase 
in p B + pc> Later discrepancies are partially due to not 
accounting for the merger of three or more runs. These 
can be estimated and are summarized as "other " in the 
diagram, but the quantities involved are not fed back to 
the recurrence through (26). 

Other possible sources of error might be due to the 
cutoffs in x used for calculations involving y(x) and p(x). 
However, extensive testing of various cutoff values has 
indicated such errors to be negligible in our 
implementation. 

Conclusions 

We have developed a model for the fractionation pro- 
cess based on deletion events excising a geometrically- 
distributed number of contiguous paralogs from either 
one of a pair of homeologous chromosomes. The exis- 
tence of data prompting this model is due to a func- 
tional biological constraint against deleting both copies 
of a duplicate pair of genes. 

The mathematical framework we propose should 
eventually serve for testing the geometric excision 
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hypothesis against alternatives such as single gene-by- 
gene inactivations, although we have not developed this 
in this paper. In addition, further developments could 
treat the gene-by-gene inactivation model as the null 
hypothesis, and the geometric excision model, with 
mean greater than 1, as the alternative hypothesis. 

Simulations of these models indicate the feasibility of 
estimating the mean ft of the deletion event process and 
the fractionation bias (p from observations of the length 
of runs of single-copy genes and the overall proportion 
of single-copy genes. 

The main question we have explored is the exact deri- 
vation of 7T, the distribution of the number of deletion 
events contributing to a run of single-copy genes. The 
simulations are convenient in practice, since they 
depend on only the parameters ^ and q> as they evolve 
over time, but they give little mathematical insight. Our 
most important advance is a deterministic recurrence 
for the n(r) as the proportion 6 of undeleted genes 
decreases. This takes into account the appearance of 
new runs over time, the lengthening of existing runs, as 
well as the merger of two existing runs with the new 



deletions to form a single, longer one. This calculation 
fits the process as simulated rather well and seems pro- 
mising for further development. 

In order to validate our fractionation model empiri- 
cally, we will have to expand it to incorporate the rear- 
rangement events that are pervasive in genome 
evolution. Our previous work on this problem shows 
that the effect of rearrangement is to seriously bias the 
observable, credible instances of fractionation towards 
smaller runs of deleted genes [4,8]. Future work on this 
difficult problem will have either to rely on careful mod- 
eling of this ascertainment bias or else find a way to 
incorporate into the model deleted runs that have been 
interrupted by rearrangements. 
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